Limit threads usage in numpy during test to avoid time-out #4584

yuxuanzhuang · 2024-05-01T05:35:00Z

Changes made in this Pull Request:

make numpy only use one thread during test
reduce pytest worker
make tests lighter

Similar to #2950, allowing numpy to make full usages of the resources clogs the machine and likely leads to the ~~occassional~~ frequent time-out in multiprocessing-related testings.

See the comparison between duration below; it takes a lot more time in test_encore for numpy calculations before.

new test duration

============================= slowest 50 durations =============================
11.82s call     MDAnalysisTests/analysis/test_contacts.py::TestContacts::test_n_initial_contacts[datafiles1-41814]
10.93s call     MDAnalysisTests/coordinates/test_gro.py::TestGROLargeWriter::test_writer_large
10.70s call     MDAnalysisTests/coordinates/test_pdb.py::test_conect_bonds_all
9.34s call     MDAnalysisTests/parallelism/test_multiprocessing.py::test_multiprocess_COG[u11]
9.02s call     MDAnalysisTests/utils/test_pickleio.py::test_fileio_pickle
8.53s call     MDAnalysisTests/analysis/test_gnm.py::test_closeContactGNMAnalysis
8.31s call     MDAnalysisTests/analysis/test_gnm.py::test_closeContactGNMAnalysis_weights_None
8.27s call     MDAnalysisTests/analysis/test_encore.py::TestEncore::test_ces_error_estimation_ensemble_bootstrap
7.62s call     MDAnalysisTests/coordinates/test_gro.py::TestGROLargeWriter::test_write_trajectory_universe
7.52s call     MDAnalysisTests/coordinates/test_gro.py::TestGROLargeWriter::test_write_trajectory_atomgroup

old test duration (https://github.com/MDAnalysis/mdanalysis/actions/runs/8869719078/job/24350703650)

20.06s call     MDAnalysisTests/analysis/test_encore.py::TestEncore::test_hes_to_self
18.93s call     MDAnalysisTests/analysis/test_encore.py::TestEncore::test_hes
13.07s call     MDAnalysisTests/analysis/test_encore.py::TestEncore::test_ces_error_estimation_ensemble_bootstrap
12.93s call     MDAnalysisTests/analysis/test_encore.py::TestEncore::test_hes_custom_weights
12.80s call     MDAnalysisTests/coordinates/test_pdb.py::test_conect_bonds_all
12.12s call     MDAnalysisTests/analysis/test_contacts.py::TestContacts::test_n_initial_contacts[datafiles1-41814]
11.40s call     MDAnalysisTests/analysis/test_gnm.py::test_closeContactGNMAnalysis
10.79s call     MDAnalysisTests/coordinates/test_gro.py::TestGROLargeWriter::test_writer_large

gonna run it three times.

pass once

pass twice

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Developers certificate of origin

I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4584.org.readthedocs.build/en/4584/

github-actions · 2024-05-01T05:37:11Z

Linter Bot Results:

Hi @yuxuanzhuang! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	✅ Passed
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/10782714307/job/29903235743

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

codecov · 2024-05-01T05:59:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.84%. Comparing base (b3208b3) to head (555520a).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4584      +/-   ##
===========================================
- Coverage    93.87%   93.84%   -0.04%     
===========================================
  Files          173      185      +12     
  Lines        21428    22494    +1066     
  Branches      3980     3980              
===========================================
+ Hits         20116    21109     +993     
- Misses         858      931      +73     
  Partials       454      454

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pep8speaks · 2024-05-01T06:36:58Z

Hello @yuxuanzhuang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file testsuite/MDAnalysisTests/analysis/test_encore.py:

Line 128:53: E231 missing whitespace after ','

Comment last updated at 2024-09-09 23:10:45 UTC

yuxuanzhuang · 2024-05-03T16:20:10Z

@orbeckst @IAlibay I now believe that these three adjustments can reduce the chance of timeouts:

Limiting the number of threads used by numpy during testing. This also decreases the duration of tests that are heavily dependent on numpy and computationally intensive, e.g. in encore.
Reducing the number of workers in pytest to prevent overloading the entire test runner.
Lightening the tests. I have already modified some, but we can certainly further optimize the other time-consuming tests.

I have re-run the GitHub CI three times consecutively, and they all pass. I have also patched it to #4162 that are more likely to time-out in yuxuanzhuang#6 and it also seems to work well.

orbeckst

Thank you for looking into this pesky issues. If your changes improve our CI then I am all in favor.

I have one comment on how to set the number of used processes for pytest-xdist (see comments) but I am not going to block over it.

I am also shocked at how insensitive the ENCORE tests are: you changed them dramatically and they still pass without changing the reference values. Maybe a thing to raise on the mdaencore repo....

orbeckst · 2024-05-03T21:45:44Z

.github/workflows/gh-ci.yaml

+        export OMP_NUM_THREADS=1
+        export MKL_NUM_THREADS=1
+        # limit to 2 workers to avoid overloading the CI
+        export PYTEST_XDIST_AUTO_NUM_WORKERS=2


Could this be set to auto, because that's what we are already using. Does auto not work correctly?

If auto does not work then I'd prefer we define a variable GITHUB_CI_MAXCORES or similar and then use it invoking pytest -n $GITHUB_CI_MAXCORES. I prefer commandline arguments over env vars when determining code behavior because you immediately see what affects the command itself.

I was planning to find a way to find the default number of workers pytest will use and reduce it by one because the Ubuntu runner has 4 cores and Mac has 3 cores. but I ended up setting it to 2 and found the performance acceptable.

Ubuntu runner has 4 cores and Mac has 3 cores.

It's unclear from this conversation that altering the value of -n is actually resulting in any difference in the issue we're seeing with the the one multiprocessing test failing (i.e. the only cause of timeouts), could you confirm this please?

If it's not affecting things, then my suggestion is to stick with auto unless there's a substantial difference in performance that we've missed.

orbeckst · 2024-05-03T21:47:49Z

azure-pipelines.yml

+      # limit to 2 workers to avoid overloading the CI
+      $env:PYTEST_XDIST_AUTO_NUM_WORKERS=1


The comment is confusing: if you set workers to 1 but say you limit to 2 workers. Change comment, otherwise same as above: perhaps just add to commandline args?

I forgot to correct the comment line when I realized Azure has only 2 cores so I changed # workers to 1.

As above, is this actually affecting timeouts or is this some separate optimization?

azure-pipelines.yml

testsuite/MDAnalysisTests/analysis/test_encore.py

orbeckst · 2024-05-03T22:00:58Z

@IAlibay would you please shepherd the PR to completion?

IAlibay

Thanks for doing this @yuxuanzhuang

I'm unfortunately pretty swamped this weekend so I can only very briefly look at this today but I'll try to get back to you asap with a longer review and better explanation of where I think those env exports should go so they apply globally.

IAlibay · 2024-05-03T22:13:06Z

.github/workflows/gh-ci.yaml

@@ -108,6 +108,13 @@ jobs:
    - name: run_tests
      if: contains(matrix.name, 'asv_check') != true
      run: |
+        export OPENBLAS_NUM_THREADS=1


todo: there's another place to put this so it applies everywhere

hmacdope · 2024-05-27T06:51:11Z

@IAlibay with my PR shepherding hat on, just pinging here, no rush though 😄

hmacdope · 2024-07-20T02:44:29Z

@IAlibay just pinging here with my review coordination hat on.

IAlibay

Based on the PR main comment and the follow-up comment, it's unclear to me how this PR specifically fixes #4209

The timeout issues are specifically related to the multiprocessing test going over the time limit, whilst the optimizations to the encore tests are nice, could you please detail what is being done here that specifically changes the behaviour of the multiprocessing test?

yuxuanzhuang · 2024-09-09T23:15:47Z

Since this PR doesn’t really resolve the time-out issue, I will repurpose it to focus on reducing the test duration.

yuxuanzhuang marked this pull request as draft May 1, 2024 05:35

yuxuanzhuang force-pushed the fix_timeout branch from c75337c to 320fca1 Compare May 1, 2024 06:34

github-actions bot added the Component-Analysis label May 1, 2024

limit numpy thread

97df4a4

yuxuanzhuang force-pushed the fix_timeout branch from ebda515 to 97df4a4 Compare May 1, 2024 07:07

github-actions bot removed the Component-Analysis label May 1, 2024

yuxuanzhuang added 2 commits May 1, 2024 09:33

limit numpy thread with prefix

21435cf

export instead

12adcb1

yuxuanzhuang changed the title ~~Test to fix test time-out~~ Limit threads usage in numpy during test to avoid time-out May 1, 2024

yuxuanzhuang marked this pull request as ready for review May 1, 2024 15:10

yuxuanzhuang marked this pull request as draft May 1, 2024 16:15

yuxuanzhuang added 2 commits May 2, 2024 14:58

reduce computation in encore

06f8d34

limit thread in azure pipeline

33ba400

github-actions bot added the Continuous Integration label May 2, 2024

yuxuanzhuang added 3 commits May 2, 2024 22:55

use 2 nworkers

757d773

limit azure to 1 core

e3192e7

reduce parallel test time

747bfb8

yuxuanzhuang marked this pull request as ready for review May 3, 2024 16:20

orbeckst approved these changes May 3, 2024

View reviewed changes

orbeckst assigned IAlibay May 3, 2024

IAlibay requested changes May 3, 2024

View reviewed changes

IAlibay self-requested a review May 27, 2024 07:16

RMeli mentioned this pull request Jun 17, 2024

[GSoC] Parallelisation of AnalysisBase with multiprocessing and dask #4162

Merged

4 tasks

Merge remote-tracking branch 'upstream/develop' into fix_timeout

e46b809

github-actions bot removed the Continuous Integration label Aug 16, 2024

orbeckst mentioned this pull request Aug 25, 2024

[CI] azure runners are flaky — remove them? #4686

Closed

IAlibay requested changes Aug 25, 2024

View reviewed changes

yuxuanzhuang added 2 commits September 9, 2024 15:56

Merge remote-tracking branch 'upstream/develop' into fix_timeout

754ff2c

restore gh-ci

555520a

yuxuanzhuang marked this pull request as draft September 9, 2024 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit threads usage in numpy during test to avoid time-out #4584

Limit threads usage in numpy during test to avoid time-out #4584

yuxuanzhuang commented May 1, 2024 •

edited

Loading

github-actions bot commented May 1, 2024 •

edited

Loading

codecov bot commented May 1, 2024 •

edited

Loading

pep8speaks commented May 1, 2024 •

edited

Loading

yuxuanzhuang commented May 3, 2024

orbeckst left a comment

orbeckst May 3, 2024

yuxuanzhuang May 3, 2024

orbeckst Aug 23, 2024

IAlibay Aug 25, 2024

orbeckst May 3, 2024

yuxuanzhuang May 3, 2024

IAlibay Aug 25, 2024

orbeckst commented May 3, 2024

IAlibay left a comment

IAlibay May 3, 2024

hmacdope commented May 27, 2024

hmacdope commented Jul 20, 2024

IAlibay left a comment

yuxuanzhuang commented Sep 9, 2024

		# limit to 2 workers to avoid overloading the CI
		$env:PYTEST_XDIST_AUTO_NUM_WORKERS=1

Limit threads usage in numpy during test to avoid time-out #4584

Are you sure you want to change the base?

Limit threads usage in numpy during test to avoid time-out #4584

Conversation

yuxuanzhuang commented May 1, 2024 • edited Loading

PR Checklist

Developers certificate of origin

github-actions bot commented May 1, 2024 • edited Loading

Linter Bot Results:

codecov bot commented May 1, 2024 • edited Loading

Codecov Report

pep8speaks commented May 1, 2024 • edited Loading

Comment last updated at 2024-09-09 23:10:45 UTC

yuxuanzhuang commented May 3, 2024

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented May 3, 2024

IAlibay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmacdope commented May 27, 2024

hmacdope commented Jul 20, 2024

IAlibay left a comment

Choose a reason for hiding this comment

yuxuanzhuang commented Sep 9, 2024

yuxuanzhuang commented May 1, 2024 •

edited

Loading

github-actions bot commented May 1, 2024 •

edited

Loading

codecov bot commented May 1, 2024 •

edited

Loading

pep8speaks commented May 1, 2024 •

edited

Loading